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1  Introduction 

1.1  Contractual  matters 

1.1.1  This  report  has  been  issued  by  Dr.  P.  Nowell  for  Rome  Laboratory  under  contract 
CSM/6694 

1.1.2  The  report  is  split  into  two  parts  which  mirror  the  major  components  of  the  tactical 
language  identification  (TLID)  software.  The  first  part  covers  the  vector  quantiser  software 
whereas  the  second  part  covers  the  sequence  analysis  software.  Further  details  of  the 
algorithms  used  and  their  application  to  the  problem  of  tactical  language  identification  can 
be  found  in  the  report  ‘Dr  P.  Nowell  and  Dr  D.  A.  Stevens,  Final  Report  on  Tactical 
Language  Identification  (U),  DERA/CIS/CIS5/CR/97472A/1 .0,  Dec.  1997’. 

1.1.3  The  manual  briefly  outlines  the  external  data  requirements,  gives  instructions  for  executing 
the  top  level  scripts  and  lists  the  major  control  parameters.  Further  usage  instructions  are 
given  for  the  major  scripts  and  executables.  Section  two  contains  the  user  manual  for  the 
vector  quantiser  component  of  the  tactical  language  identification  software.  The  user 
manual  for  the  sequence  analysis  is  contained  in  section  three  and  follows  the  same  format. 

1.1.4  Sections  four  provides  additional  information  that  may  be  useful  in  adapting  the  scripts  to 
new  data  and/or  languages.  Finally,  section  5  describes  the  file  structure  for  the  installed 
software. 
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2  Vector  Quantiser  User  Manual 


2.1  Introduction 

2.1.1  This  section  details  the  use  of  the  vector  quantiser  (VQ)  based  TLID  software.  Details 
include  descriptions  of  how  to  use  the  three  scripts  for  training  and  testing  of  the  data  as 
well  as  for  sequence  generation  as  required  by  the  sequence  analysis  software  (see  section 
3).  The  guide  also  includes  details  of  the  executables  that  are  at  the  core  of  all  the  VQ 
scripts. 


2.2  External  data  requirements 

2.2.1  The  external  data  requirements  for  the 

below. 

1.  Speechlist  files 

2.  Speech  data  files 

3.  Annotation  flies 

4.  Pre-processor  definition  file 


VQ  component  of  the  TLID  software  are  listed 

(pairwise  list  of  speech  data  and  annotation  files) 
(as  listed  in  the  speechlist  file) 

(as  listed  in  the  speechlist  file) 

(e.g.  ppfiles/mfcl6-i-dc-t-Dc5.pp) 


2.2.2  The  speechlist  files  are  text  files  containing  a  pairwise  list  of  the  speech  data  and 
annotation  files  that  are  used  for  training  or  testing. 

2.2.3  The  speech  data  files  are  binary  files  containing  a  spectrogram  type  representation  of  the 
speech  data.  The  first  512  bytes  of  each  file  constitutes  a  header  (see  appendix  A)  which 
describes  amongst  other  things  the  frame  rate  and  vector  size. 

2.2.4  The  annotation  files  are  also  binary  files  and  have  a  similar  format  to  the  speech  data  files 
(see  appendix  A).  Annotation  files  are  used  to  specify  sub-regions  within  each  speech  file 
that  are  to  be  used  for  training  or  testing 


2.3  Executing  the  scripts 

2.3.1  The  training  process  is  controlled  by  a  single  script  (VQtrain .  perl)  which  is  used 
repeatedly  to  generate  a  codebook  model  for  each  target  language. 

2.3.2  Once  the  language  specific  codebooks  have  been  generated  the  VQ  language  classifier  is 
tested  using  the  script  VQtest.perl.  The  program  generates  an  output  file  containing 
the  scores  assigned  to  each  test  utterance  by  each  codebook. 

2.3.3  A  third  script  (GenSeq.perl)  is  used  to  generate  the  sequence  files  used  by  the  sequence 
analysis  software  (see  section  3).  The  script  uses  the  codebooks  generated  by 
VQtrain .  perl  to  convert  a  series  of  speech  data  files  into  sequence  files. 
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2.4  Changing  the  control  parameters 

2.4.1  A  number  of  control  parameters  are  embedded  with  the  training  and  testing  scripts  as 
described  below.  In  addition  the  two  main  executables  (CBS  and  MCB,  see  sections  2.8 
and  2.9  for  details)  have  a  number  of  parameters  that  can  be  adjusted  to  change  the  training 
and  testing  of  the  vector  codebook  models. 

2.4.2  The  major  parameters  for  the  training  script  VQ train .  perl  are  as  follows 


$MinBkSize 

$MaxBkSize 

$TheTrainProg 

$TheLanguage 

$exptDir 

$TheSpeechList 

$ThePreProc 


Minimum  codebook  size  to  generate 

Maximum  codebook  size  to  generate 

Full  pathname  of  cbS  executable 

Unique  language  identifier 

Main  root  directory  for  current  experiment 

Full  pathname  of  the  speechlist  file 

Full  pathname  of  the  pre-processor  definition  file 


2.4.3  The  parameters  $MinBkSize  and  $MaxBkSize  are  self  explanatory  in  that  they  are  the 
upper  and  lower  limits  of  the  codebook  sizes  to  generate  while  training.  The  main  root 
directory  $exptDir  is  the  directory  where  the  scripts  are  held,  together  with  all  the  sub¬ 
directories  containing  the  codebook  model  files.  For  any  set  of  experiments  where  training 
and  testing  is  used,  the  $exptDir  will  generally  remain  the  same  for  all  training  and 
testing  scripts.  The  $TheTrainProg,  $TheSpeechList  and  $ThePreProc  are 
described  in  section  on  external  data  requirements  (section  2.2). 

2.4.4  The  test  script  VQtest .  perl  has  a  similar  set  of  parameters  :- 

$TheTestProg  Full  pathname  of  mcb  executable 

$exptDir  Main  root  directory  for  current  experiment 

$TheSpeechList  Full  pathname  of  the  speechlist  file 

$ThePreProc  Full  pathname  of  the  pre-processor  definition  file 

$CBList  Full  pathname  of  the  codebook  list  file. 


2.4.5  The  $exptDir  and  $ThePreProc  parameters  listed  above  should  be  identical  to  those 
listed  in  the  VQtrain.perl  script.  The  $TheSpeechList  and  $CBList  parameters 
should  point  to  the  lists  which  identify  which  files  to  test  and  which  codebook  models  to 
test  them  with. 
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2.5  VQtrain.perl 

For  each  language  a  separate  version  of  this  script  is  required.  In  general  all  the  parameters  will 
remain  identical  except  for  the  $TheLanguage  setting  within  the  script.  It  is  common  practice  to 
generate  several  scripts  with  the  name  VQtrain_${TheLanguage}  .perl  to  identify  which 
script  has  generated  which  language  model.  For  example  the  script  for  generating  the  English  model 
is  called  VQtrain_english .  perl. 

Usage: 

VQtrain _ $ {TheLanguage} .perl 

Input  File(s): 

The  script  requires  that  the  speechlist,  speech  data  files,  annotation  files  and  pre-processor  file  have 
already  been  generated  (see  section  2.2). 

Intermediate  File(s): 

None 

Output  File(s): 

The  script  generates  separate  codebook  files  in  the  directory  codebooks/$ TheLanguage  for 
each  codebook  in  the  range  $MinBkSize  to  $MaxBkSize.  The  files  are  named  where  $size  is 
a  zero  padded  number  of  states  in  the  codebook.  Consequently  for  a  script  which  trains  the  English 
language  with  $MinBkSizesetto4  and  $MaxBkSize  set  to  16  there  are  3  output  files,  namely; 
english.bk.  0004,  english.bk.  0008  and  english.bk.  0016. 

$exptDir/codebooks/$TheLanguage/$Language . bk . $size 
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2.6  VQtest.perl 

The  test  script  for  the  VQ  based  language  identification  system  requires  that  the  VQtrain .  perl 
script  has  been  used  to  generate  a  set  of  codebooks  for  a  set  of  languages. 

Usage: 

VQtest . perl 

Input  File(s): 

The  test  script  requires  speechlists,  speech  data  files  and  annotation  files  in  the  same  format  as  that 
used  by  the  training  script. 

In  addition  a  codebook  list  file  specifies  a  list  of  codebook  files  and  directories  such  that  each  line  in 
the  codebook  list  becomes  a  full  pathname  description  of  one  of  the  language  specific  codebooks 
when  prefixed  by  the  directory  codebooks.  E.g  for  a  sixteen  element  English  codebook  the  entry 
would  be 

english/english . bk . 0016 

Intermediate  File(s): 

None 

Output  File(s): 

The  output  file  is  contained  in  the  $exptDir/results  directory  and  is  given  the  name 
results  .  $size  where  $size  is  the  number  of  states  used  in  the  codebooks.  I.e.  the  results  are 
written  to 

$exptDir/results/results . $size 

The  results  file  contains  a  list  with  three  space  separated  columns.  The  columns  are,  in  order;  the 
speech  data  file  used  in  the  test;  the  codebook  file  against  which  it  has  been  tested;  the  score 
obtained  with  the  given  speech  file  and  codebook  file.  Thus  for  each  speech  data  listed  in  the  speech 
list  file  N  lines  are  generated  in  the  results  file  where  N  is  the  number  of  codebooks  listed  in  the 
codebook  list  file.  For  each  test  file  the  N  entries  are  ordered  in  preference  such  that  the  most  likely 
language  is  the  first  of  the  N  entries  and  the  least  likely  codebook  is  the  last  entry  for  that  test  file. 
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2.7  GenSeq.perl 


GenSeq .  perl  is  used  to  generated  the  VQ  labelled  training  and  test  data  for  the  sequence  analysis 
component  of  the  TLID  system  (see  sections). 

Usage: 

GenSeq . perl 

Input  File(s): 

The  test  script  requires  speechlists,  speech  data  files  and  annotation  files  in  the  same  format  as  that 
used  by  the  training  script. 

In  addition  a  codebook  list  file  specifies  a  list  of  codebook  files  and  directories  such  that  each  line  in 
the  codebook  list  becomes  a  full  pathname  description  of  one  of  the  language  specific  codebooks 
when  prefixed  by  $exptDir/codebooks.  E.g  for  a  sixteen  element  English  codebook  the  entry 
would  be 

english/english . bk . 0016 


Intermediate  File(s): 

None 

Output  File(s): 

The  output  from  this  script  is  in  the  form  of  sequence  files,  N  per  input  file  as  listed  in  the  speechlist 
file,  where  N  is  the  number  of  codebooks  listed  in  the  codebook  list  file.  For  each  speech  file  in  the 
speechlist  a  series  of  sequence  files  are  generated  with  the  same  name  as  the  original  speech  file 
except  that  the  filename  ends  in  .seq.  I.e 

$exptDir/sequences/<codebook_name>/<speech_f ilename> . seq 

The  sequence  file  is  a  text  file  of  space  separated  state  identifiers  with  a  state  number  output  per 
frame  of  input  data  and  each  state  number  being  preceded  by  the  letter  ‘s’  as  a  further  separator. 
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2.8  CB3  executable 

2.8.1  The  CBS  executable  is  designed  for  both  training  and  testing  of  a  codebook  model  against 
a  set  of  speech  files  as  listed  in  the  speechlist  file.  The  options  available  upon  running  the 
CBS  program  are  as  follows  together  with  their  default  values: 

Train(+)  or  test(-)  mode,  valid  arguments 
Speech  list  file  name. 

Pre-processor  definition  file  name 
Codebook  file  name  (input  if  testing,  output  if  training) 
Output  distance  filename  (debugging) 

Output  sequence  file  name  descriptor 
State  occupancy  flag  for  calculating  distances 
Minimum  codebook  size  to  generate  (power  of  2) 
Maximum  codebook  size  to  generate  (power  of  2) 
Variance  terms  flag  for  the  codebook  model 
Full  covariance  flag  for  the  codebook  model 
Global  variance  flag  for  the  codebook  model 
Minimum  number  of  frames  per  state  before  splitting 
Hard  limit  on  the  minimum  variance  allowable 
Minimum  number  of  iterations  per  codebook  size 
Maximum  number  of  iterations  per  codebook  size 
Convergence  value 

No.  of  frames  over  which  convergence  is  calculated 
Perturbation  magnification  for  splitting  states 
Verbose  flag  (used  for  debugging) 

2.8.2  Within  the  scripts  which  use  the  CBS  program  the  majority  of  the  default  settings  are  kept. 
However  it  is  possible  to  alter  the  scripts  in  order  to  allow  different  codebook  structures  or 
different  training  conditions.  Only  one  of  the  three  parameters  variance,  globv  and 
f  ullcovar  may  be  set  to  positive  although  all  of  them  may  be  set  to  negative  to  obtain  a 
system  which  does  not  use  variance  terms  at  all  as  in  the  early  OGI  experiments  [Nowell 
and  Stevens  1997]. 

2.8.3  The  occ=  term  is  used  to  include  the  prior  probability  of  a  state  occurring  in  the  distance 
calculation.  Any  given  state  will  occur  a  number  of  times  during  the  training  data  and  as 
such  it  is  possible  to  calculate  likelihood  of  this  state  occurring  for  a  given  language  and 
used  in  the  distance  measure. 

2.8.4  For  the  seqf  ile=  parameter  the  string  sequence  %s  will  be  replaced  by  the  input  speech 
file  name  header  such  that  by  setting  seqfile=<dir>/%s  .  seq  it  is  possible  to  direct 
the  program  to  put  all  the  sequence  files  in  <dir>  directory,  with  the  same  name  as  the 
respective  speech  file  in  the  speech  list  except  for  the  ending  which  is  replaced  by  .  seq. 


train=+ 

speechlist= 

preproc= 

cbfname= 

distfile= 

seqfile= 

occ=+ 

minBk=l 

inaxBk=32 

variance=- 

f ullcovar =- 

globv=- 

min_occ=10 

min_var=le-06 

min_it=10 

max_it=30 

converge=0 .0001 

conv_it=3 

pmag=le-06 

verbose=- 
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2.8.5  The  distf  ile=  parameter  is  identical  to  the  seqf  ile=  parameter  except  for  the  format 
of  the  file  output.  The  distance  file  contains  information  regarding  the  distance  on  a  per 
frame  basis  with  one  frame  per  line. 
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2.9  MCB  executable 


2.9.1  While  it  is  possible  to  use  the  CBS  program  for  testing  a  file  against  several  codebooks  it 
would  be  necessary  to  run  the  CB  program  separately  for  each  codebook  in  turn.  The  MCB 
executable  is  a  simple  test  program  which  uses  many  of  the  same  routines  as  the  CBS 
program  but  is  able  to  load  multiple  codebook  files  at  a  time. 


cb_dir= 

cblist= 

speechlist= 

preproc= 

results= 

verbose=- 


Codebook  root  directory 
Codebook  list  file 
Speech  list  file 
Pre-processor  definition  file 
Results  output  file  name 
Verbose  flag  (used  for  debugging) 


2.9.2  The  filenames  listed  in  the  cblist  file  are  appended  to  the  base  directory  as  defined  by 
the  cb_dlr  parameter  to  give  the  full  path  name  for  each  codebook  model  required.  The 
speechlist  and  preproc  parameters  are  used  to  indicate  which  speech  list  file  and 
pre-processor  file  to  use  as  in  the  CBS  program.  The  results  parameter  is  location  and 
filename  of  the  output  results  file. 

2.9.S  All  parameters  regarding  the  codebook  structure  are  unnecessary  as  they  are  contained 
within  the  codebook  file  itself  and  are  set  up  when  the  file  are  read  in.  This  is  also  true  for 
the  CBS  program  when  used  in  test  mode. 
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3  Sequence  Analysis  User  Manual 

3.1  Introduction 

3.1.1  This  section  of  the  report  contains  the  user  manual  for  the  sequence  analysis  component  of 
the  tactical  language  identification  (TLID)  system.  Before  any  of  this  software  can  be  used 
it  is  first  necessary  to  build  the  VQ  codebooks  and  then  quantise  the  training  and  test  data. 
Instructions  for  doing  this  are  contained  in  the  user  manual  for  the  vector  quantiser  [section 
2]. 


3.2  External  data  requirements 

3.2.1  The  sequence  analysis  component  of  the  TLID  system  builds  upon  the  output  of  the  vector 
quantiser.  It  is  therefore  necessary  to  first  create  a  suitable  quantiser,  vector  quantise  the 
training  and  test  data  and  then  copy  (or  create  links  to)  this  data  in  the  directory  VQdata. 
The  following  data  needs  to  be  transferred 


1 .  Pre-processor  definition  file 

2.  Vector  quantiser  codebooks 

3.  Vector  quantised  sequences 

4.  Distance  matrix 

5.  Training  speechlists 

6.  Test  speechlists 

7.  Speech  data 


(e.g.  ppfiles/mfcl6+dc-i-Dc5.pp) 

(e.g.  ExptsSc/codebooks/english/  english.bk.0128.Z) 
(e.g.  sequences/english.bk.0128/en003stb.seq.Z) 
(e.g.  english.dmat2.0128.Z) 

(e.g.  train_dev_english.tsl) 

(e.g.  test45_dev_english.tsl) 

(e.g.  /cdrom/dft/en003stb.dft) 


3.2.2  The  pre-processor  definition  file  is  a  text  file  which  describes  the  various  pre-processing 
stages  that  are  applied  to  the  speech  data.  These  stages  will  typically  include  the  calculation 
of  the  mean  power,  cosine  coefficients  and  deltas  thereof.  For  further  details  of  the  various 
pre-processors  that  have  been  used  and  their  effects  on  the  performance  of  the  vector 
quantiser  see  [Nowell  and  Stevens  1997]. 

3.2.3  The  vector  quantiser  codebooks  are  trained  on  the  parameterised  speech  data.  Each 
codebook  has  a  number  of  entries  (e.g.  128)  with  means  and  variances  representing 
prototypical  speech  vectors.  The  sequence  analysis  software  requires  two  codebooks,  the 
first  larger  codebook  is  used  to  model  the  speech  key-fragments  and  the  second  smaller 
codebook  (in  these  experiments  containing  just  2  entries)  is  used  as  a  babble  model  for  the 
non  key-fragment  speech.  The  codebooks  are  stored  as  compressed  files  in  order  to  save 
disk  space. 

3.2.4  The  sequences/english .  bk .  0128/  directory  contains  the  vector  quantised  training 
data  generated  using  the  larger  of  the  two  English  codebooks.  These  files  will  be  processed 
by  the  sequence  analysis  software  in  order  to  extract  acoustically  similar  sub-sequences 
which  can  then  be  used  by  the  language  classifier. 
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3.2.5  The  distance  matrix  file  contains  pre-computed  distances  reflecting  the  acoustic  similarity 
of  the  codebook  entries  in  relation  to  one  another  and  the  smaller  babble  codebook. 

3.2.6  The  training  and  test  speechlists  contain  a  list  of  the  speech  files  that  should  be  used  for 
training  and  testing.  The  training  data  is  split  into  development  and  evaluation  sub-sets 
with  separate  speechlists  for  each  sub-set  and  language.  Likewise,  the  test  data  is  also  split 
into  language  specific  development  and  evaluation  sub-sets.  In  addition  separate 
speechlists  are  used  for  test  files  either  approximately  45  seconds  or  15  seconds  long. 

3.2.7  Finally  the  raw  speech  data  needs  to  be  on-line.  This  data  can  require  considerable  amounts 
of  disk  storage  space  so,  in  the  case  of  OGI,  the  data  files  for  each  language  are  stored  on 
separate  CD-ROMs  which  are  accessible  over  the  network.  The  text  file 
'  files/drives .  txt'  contains  a  mapping  between  the  language  name  and  the 
machine  which  holds  the  CD-ROM  for  that  language. 

3.3  Executing  the  scripts 

3.3.1  The  training  and  testing  process  is  controlled  by  a  single  top-level  script  (tlid.rcp) 
which  in  turn  calls  a  number  of  sub-scripts  and  executables  as  required  for  training  and 
testing.  To  train  and  test  a  classifier  for  the  target  language  ‘<language>’  this  script  is 
executed  using  simply: 

tlid.rcp  <language> 

3.3.2  This  script  initialises  various  variables  and  then  goes  on  to  call  several  sub-scripts  which  in 
turn  implement  the  various  stages  involved  in  training,  testing  and  scoring  the  language 
classifier. 

3.3.3  During  training  the  following  sub-scripts  are  used 

1.  analyse.rcp 

2.  cluster.rcp 

3.  count. rep 

4.  select.rcp 

3.3.4  and  the  following  are  used  during  testing. 

1.  count.rcp 

2.  classify.rcp 

3.  score.rep 

3.3.5  The  first  sub-script  analyse .  rep  is  responsible  for  extracting  similar  sub-sequences 
from  the  vector  quantised  training  data.  These  sub-sequences  are  then  clustered  by 
cluster .  rep  which  collects  together  similar  sub-sequences  representing  the  same  or 
similar  underlying  sounds.  Occurrences  of  each  cluster  centroid  are  then  counted  in  the 
training  data  by  count.rcp  and  these  counts  are  used  by  select.rcp  to  calculate 
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‘usefulness’  scores  and  determine  the  most  discriminative  sub-sequences.  During  testing 
occurrences  of  the  most  discriminative  sub-sequences  are  counted  in  the  test  data  again 
using  count .  rep  .  The  occurrence  counts  and  associated  ‘usefulness’  scores  of  each 
fragment  are  accumulated  and  the  total  used  by  classify,  rep  to  classify  the  individual 
test  files.  Finally,  the  classified  test  files  are  scored  using  score .  rep  and  the  output  is 
presented  as  a  list  of  false  alarm  rates  and  probability  of  detection  for  a  range  of  detection 
thresholds. 


3.4  Changing  the  control  parameters 

3.4.1  There  are  a  number  of  parameters  which  can  be  changed  in  order  to  modify  the  behaviour 
of  the  sequence  analysis  software.  Most  of  these  are  now  contained  in  a  single  file 
‘scripts/common .  args '  which  has  comments  describing  their  purpose. 

3.4.2  At  the  end  of  t lid .  rep  there  is  a  command 

$score  $basedir  $target  -4000  250  1000 

3.4.3  This  command  runs  the  scoring  script  with  a  range  of  threshold  values  ranging  from  -4000 
to  1000  in  increments  of  250.  In  order  to  generate  reasonable  ROC  plots  it  is  necessary  to 
obtain  a  reasonable  number  of  data  points  ranging  from  0%  to  100%  false  alarms  and  true 
detections.  This  can  be  achieved  by  varying  the  threshold  and  increment  values  until  the 
appropriate  points  are  generated.  In  practice  the  training  and  testing  scripts  are  run  through 
to  completion  for  each  language  and  the  command 

scripts/score . rep  .  <language>  <start>  <increment>  <end> 

3.4.4  is  repeatedly  executed  in  the  experiment  directory. 
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3.5  tlid.rcp 

A  single  top-level  script  (tlid.rcp)  is  used  to  train  and  evaluate  a  classifier  for  each  language  of 
interest.  This  script  is  called  as 

Usage: 

tlid.rcp  <target>  -  Target  language  (e.g.  English) 

Input  files 

None 

Intermediate  files 

See  description  of  component  shell  scripts 

Output  files 

None 

<target>  is  the  target  language  (e.g.  german,  farsi  etc).  This  script  calls  a  number  of  other  scripts 
in  turn  to  carry  out  the  various  stages  involved  in  training  and  evaluating  the  language  classifier. 
These  stages  and  their  control  scripts  are  described  in  order  of  processing. 
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3.6  analyse.rcp 


The  first  sub-script  scripts/analyse .  rep  is  used  to  analyse  the  training  speech  data  files 
and  locate  similar  sub-sequences  (i.e.  similar,  re-occurring  sounds).  For  the  purposes  of  language 
identification  one  would  want  these  sub-sequences  to  represent  phonemes,  word  fragments  (prefixes 
and  suffixes)  and  possibly  entire  words  ( yes,  ja  etc). 


Usage: 

analyse.rcp  <target> 
<codesize> 
<trainlist> 
<similarity> 


Target  language  (e.g.  English) 

Codebook  size  (e.g.  0128) 

File  containing  list  of  training  files 
Similarity  threshold 


Input  File(s): 

The  script  requires  that  the  training  speech  data  has  been  vector  quantised  (i.e.  converted  to  a 
symbol  stream).  The  argument  <trainlist>  gives  the  name  of  the  file  which  contains  the  list  of 
vector  quantised  training  files. 

A  pre-computed  distance  matrix  is  also  required  for  determining  the  self-similar  regions. 

Intermediate  Files(s): 

The  processing  of  each  input  file  leads  to  the  production  of  an  output  file  (extension  .pms) 
containing  a  list  of  partial  matches. 

$exptdir/train/$basename . pms 

The  contents  of  each  file  are  clustered  (using  cluster .  rep)  in  order  to  group  together  similar 
partial  matches,  the  clusters  are  stored  as  intermediate  files. 

$exptdir/train/$basename . els 

Output  File(s): 

The  output  is  a  set  of  files  containing  a  list  of  similar  sub-sequences  that  were  found  in  the  input  VQ 
sequence  files. 


$exptdir/train/$basename . cen 

These  sub-sequences  are  actually  the  centroids  (i.e.  most  representative  examples)  of  the  clustered 
partial  matches. 
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3.7  cluster.rcp 


The  partial  match  files  generated  by  analyse.rcp  typically  contain  multiple  entries  for  the  same 
underlying  sound  due  to  repetitions  of  the  sound  in  the  training  file(s).  It  is  necessary  to  cluster  these 
multiple  entries  in  order  to  determine  a  single  representative  sequence  for  each  sound. 
Cluster .  rep  is  called  from  analyse .  rep  to  cluster  the  partial  matches  within  each  training  file 
and  then  again  from  tlid .  rep  to  re-cluster  the  centroids  of  the  clustered  training  files. 

The  two  stage  clustering  process  is  significantly  quicker  than  simply  concatenating  and  clustering 
the  partial  matches  in  all  the  training  files.  The  end  result  will  be  similar  provided  that  the  same  (or 
similar)  centroids  are  generated  with  sufficient  frequency  in  the  training  data.  This  assumption  will 
almost  certainly  be  true  for  any  fragments  that  are  likely  to  be  ‘useful’  since  these  will  occur  often 
and  in  most  training  files. 


Usage: 

cluster.rcp  <options> 
<target> 
<codebook> 
<fragments> 
<siitiilarity> 

<output> 


Input  File(s): 


Additional  options 

Target  language  (e.g.  english) 

Codebook  size  (e.g.  0128) 

Input  file  containing  fragments 
Similarity  threshold 
Minimum  cluster  size 
Output  file 


The  input  consists  of  a  text  file  (<fragments>)  containing  the  fragments  to  be  clustered  and  a 
distance  matrix  ($  codebook)  containing  pre-computed  inter-state  similarity  scores. 

Intermediate  Files(s): 

None 

Output  File(s): 

The  output  is  a  text  file  (extension  .cls)  which  contains  the  individual  clusters.  The  first 
fragment  in  each  cluster  is  the  centroid  (i.e.  most  representative  example)  of  that  cluster. 
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3.8  count.rcp 


Count .  rep  is  basically  a  pre-processor  for  as  tree .  rep  which  actually  uses  the  continuous 
speech  recogniser  (which  we  call  astrec)  to  count  the  fragment  occurrences  in  the  training  or  test 
speech. 

This  shell  script  generates  the  necessary  control  files  to  configure  the  speech  recogniser  to  be  an 
acoustic  fragment  spotter.  These  include  single  state  babble  and  fragment  models  derived  from  the 
VQ  codebooks,  a  pronunciation  dictionary  mapping  fragment  strings  to  the  appropriate  sequence  of 
single  state  models  and  syntax  files  allowing  the  recognition  of  acoustic  fragments  interspersed  with 
‘babble’. 

A  detailed  description  of  the  recogniser  configuration  files  is  beyond  the  scope  of  this  report. 


Usage: 

count.rcp  <target> 

<codebook> 
<speechlist> 
<f ragments> 


Target  language  (e.g.  english) 
Codebook  size  (e.g.  0128) 

List  of  VQ'd  speech  files 
Fragments  to  be  counted 


Input  File(s): 

The  input  consists  of  the  feature  and  babble  codebooks,  top-level  recogniser  syntax  file 
(files/main.non),  a  list  of  fragments  to  be  counted  and  a  file  containing  a  mapping  from 
language  to  CD-ROM  drive  containing  the  corresponding  speech  data. 


Intermediate  Files(s): 


None 


Output  File(s): 

Recogniser  configuration  files,  the  syntax  and  semantics  of  these  files  are  beyond  the  scope  of  this 
report. 
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3.9  astrec.rcp 

Astrec .  rep  is  a  wrapper  which  provides  the  arguments  for  the  speech  recognition  software.  The 
script  first  generates  a  speechlist,  using  the  target  language  to  identify  the  machine  which  contains 
the  speech  data  and  thereby  the  full  pathname  of  each  input  file.  The  location  of  the  speechlist  and 
various  configuration  files  are  then  simply  passed  on  to  the  recognition  software.  The  output  from 
the  recogniser  is  converted  into  the  format  expected  by  subsequent  stages.  This  involves  converting 
each  recogniser  .  res  file  into  a  .  ent  file  containing  a  count  reflecting  the  duration  of  the  file  (in 
this  case  the  number  of  VQ  symbols)  followed  by  a  count  of  the  number  of  occurrences  of  each 
fragment. 

Usage: 

astrec.rcp  <target>  -  Target  language  (e.g.  english) 

<codebook>  -  Codebook  size  (e.g.  0128) 

<speechlist>  -  List  of  VQ'd  speech  files 

<fragments>  -  Fragments  to  be  counted 

Input  File(s): 

The  input  includes  the  configuration  files  generated  by  the  previous  script  (count .  rep)  as  well  as 
the  main  speechlist  file  which  lists  the  set  of  training  or  test  files. 

Intermediate  Files(s): 

Th[e  script  generates  a  speechlist  for  the  recogniser  which  lists  the  location  of  each  speech  file. 
Separate  recognition  output  files  (extension  .  res)  are  also  generated  for  each  input  file  and  these 
are  stored  in  the  directory  $dstdir.  Each  file  consists  of  a  table  containing  five  columns.  The  first 
two  columns  give  the  starting  and  ending  frames  (where  the  frame  rate  is  typically  100  frames  per 
second)  in  the  speech  signal  where  the  fragment  was  spotted.  The  third  and  fourth  columns  give  the 
negative  log.  likelihood  score  of  the  detected  fragment  followed  by  the  average  score  per  frame. 
Finally,  the  last  column  identifies  the  fragment  that  was  spotted,  the  number  refers  to  the  ranking  of 
the  fragment  in  the  input  file  $  fragments  which  contains  the  list  of  extracted  VQ  sequence 
fragments. 

Output  File(s): 

The  output  is  a  set  of  count  files  (extension  .  ent)  which,  for  each  recognition  results  file,  contains 
a  count  of  the  total  number  of  VQ  symbols  in  that  file  (i.e.  a  measure  of  the  file  length)  followed  by 
occurrence  counts  for  each  fragment  in  turn. 
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3.10  selectrcp 


Select .  rep  uses  the  .cnt  files  generated  by  count .  rep  and  astrec  .  rep  to  rank  and  select  a 
subset  of  the  most  discriminative  fragments.  The  individual  counts  for  each  training  file  are 
gathered  together  and  used  to  generate  an  intermediate  file  which  contains  the  total  number  of 
occurrences  of  each  fragment  per  language.  The  values  in  this  file  are  used  to  compute 
discriminative  ‘usefulness’  scores  for  each  of  the  fragments.  A  ranked  listing  of  the  fragments  and 
their  scores  are  written  to  the  output  file. 


Usage: 

select. rep  <target> 

<f ragments> 
<index> 
<mode> 
<selection> 


Target  language  (e.g.  english) 
Fragments  set  to  select  from 
Mapping  from  fragments  to  counts 
Selection  mode  (usefulness  etc.) 
File  to  store  selection 


Input  File(s): 

The  input  consists  of  a  list  of  fragments,  another  list  also  containing  a  list  of  fragments  which  is 
used  to  map  the  values  in  the  count  ( .  cnt)  files  back  onto  the  original  VQ  sub-sequence  and  finally 
the  directory  containing  the  list  of  .  cnt  files. 

Intermediate  riles(s): 

The  individual  .  cnt  files  are  combined  into  a  single  file  which  gives  the  accumulated  counts  for 
each  language  in  turn.  The  file  $tmpdir/ccounts  .  tmp  consists  of  a  list  of  languages  to  be 
classified,  the  estimated  length  of  the  language  specific  training  data  and  the  number  of  occurrences 
of  each  fragment  in  each  language.  These  counts  are  obtained  by  simply  accumulating  the  counts 
from  the  individual  training  files. 

Output  File(s): 

The  output  is  a  single  file  containing  the  fragments  ranked  by  their  discriminative  usefulness  scores. 
The  first  column  is  the  frequency  weighted  score  which  is  then  followed  by  the  incremental  score 
and  finally  the  VQ  sub-sequence  to  which  the  scores  relate. 
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3.11  classify. rep 

Classify.rcp  is  used  to  classify  the  .  res  files  according  to  target  language.  The  input  to  the  process 
consists  of  the  .  ent  files  generated  by  counting  occurrences  of  the  selected  fragments  in  the  test 
data  and  the  file  containing  the  list  of  selected  fragments  with  their  discriminative  usefulness  scores. 

The  occurrence  counts  in  each  .  ent  file  are  normalised  by  the  length  of  the  file  and  then  used 
along  with  the  incremental  ‘usefulness’  seores  to  calculate  the  total  score  for  each  file.  The  output  is 
a  file  containing  the  scores  for  each  test  file. 

Usage: 

classify.rcp  <target>  -  Target  language  (e.g.  english) 

<speechlist>  -  List  of  test  files 

<selection>  -  Selected  fragments 

<output>  -  Language  classifications 

Input  File(s): 

The  input  files  consist  of  a  speechlist  which  gives  the  location  of  the  test  files  and  from  which  the 
location  of  the  .  ent  files  can  be  determined.  A  separate  file,  generated  by  select .  rep,  contains 
the  list  of  fragments  and  their  associated  usefulness  scores. 

Intermediate  Files(s): 

None 

Output  File(s): 

The  output  is  a  single  file  which,  for  each  test  file,  lists  the  correct  language  classification,  the 
filename  and  the  accumulated  usefulness  score.  This  score  should  be  large  when  the  test  file  belongs 
to  the  target  language  and  small  or  negative  otherwise. 


DRA/CIS/CIS5/CR/97472B/ 1 .0 


UNCLASSIFIED 


21 


UNCLASSIFIED 


3.12  score.rcp 


Usage: 

Score .  rep  takes  the  output  file  generated  by  classify .  rep  and  calculates  the  probability  of 
detection  and  probability  of  false  alarm  for  a  range  of  detection  thresholds  (the  actual  calculations 
are  actually  performed  by  a  perl  script).  The  starting  threshold,  increment  and  final  threshold  value 
are  supplied  as  arguments.  It  is  often  necessary  to  manually  run  the  scoring  script  with  a  variety  of 
different  arguments.  The  data  points  for  the  ROC  plots  displayed  previously  were  generated  by 
manually  adjusting  the  parameters  so  as  to  give  twenty  data  points  over  the  range  of  the  plot. 


score.rcp  <target> 

<minthreshold> 
< i nc remen t> 
<maxthreshold> 


Target  language  (e.g.  english) 
Initial  threshold  value 
Threshold  increment 
Final  threshold  value 


Input  File(s): 

Classification  file  produced  by  classify .  rep 

Intermediate  Files(s): 

None 

Output  File(s): 

None,  the  results  are  written  directly  to  the  display. 
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3.13  common.args 


Usage: 

Common .  args  is  used  as  a  contained  for  common  variables  and  definitions  that  are  used  by  a 
number  of  other  scripts,  it  is  not  called  in  it’s  own  right. 

The  script  defines  values  for  the  datatype  of  the  VQ  speech  files,  locations  of  the  VQ  symbol  set 
and  distance  matrix,  the  DP  deletion  insertion  substitution  penalties  and  the  similarity  threshold 
used  for  the  detecting  partial  matches  and  clustering. 

Input  File(s): 

None 

Intermediate  Files(s): 

None 

Output  File(s): 

None 
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4  Adapting  to  new  data  /  languages 

4.1  Vector  quantiser  component 

4.1.1  The  main  process  in  adapting  to  new  data  and/or  languages  lies  in  the  generation  of  new 
speech  data  and  annotation  files.  The  speech  data  files  contain  a  spectrogram  type 
representation  of  the  speech  signal  and  can  be  generated  using  either  a  filterbank  or  fast 
fourier  transform  (FFT)  algorithm.  The  annotation  files  indicate  regions  of  the  speech  data 
files  that  are  to  be  used  for  training  or  testing. 

4.1.2  Additionally,  speechlist  files  will  have  to  be  generated  which  give  the  locations  of  the 
speech  data  and  annotation  files.  These  files  are  simple  text  files  which  can  be  most  easily 
generated  by  modifying  existing  files  such  as  those  included  with  the  installation. 

4.2  Sequence  analysis  component 

4.2.1  In  theory  changing  the  training  and  test  data  should  require  little  or  no  changes  to  the 
individual  scripts.  The  codebooks  will  need  to  be  generated,  the  speech  data  vector 
quantised  and  then  the  corresponding  files  copied  to  the  directory  VQdata  as  described  in 
section  3.2. 

4.2.2  At  the  head  of  tlid .  rep  there  is  a  string  containing  the  set  of  languages  in  the  training 
and  test  data.  If  the  new  set  differs  from  that  used  for  OGI  (i.e.  english,  farsi,  french, 
german,  Japanese,  korean,  mandarin,  Spanish,  tamil  and  Vietnamese)  then  this  string  will 
need  to  be  changed. 

4.2.3  The  text  file  ‘files/drives  .  txt'  contains  a  mapping  between  each  language  and 
the  name  of  the  machine  holding  the  CD-ROM  with  the  speech  data  for  that  language. 
Again,  if  the  language  set  differs  from  OGI  then  this  file  will  need  updating. 

4.2.4  It  may  also  be  necessary  to  optimise  the  various  parameters  for  the  new  data  set,  these 
parameters  are  described  in  section  3.4. 
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File  Locations 


5  File  Locations 

The  following  file-structure  shows  the  location  of  various  files  relative  to  the  top-level  directory  in 
which  the  files  are  installed.  Any  changes  to  this  file  structure  will  be  recorded  in  the  file  README 
file  in  the  top-level  directory, 

VQdata/  -  Contains  data  from  VQ  training 

codebooks/  -  Contains  VQ  codebooks 

english/  -  Contains  english  codebooks 

english.bk, 0002 . Z  -  Compressed  2  entry  codebook 
english. bk . 0128 . Z  -  Compressed  128  entry  codebook 

ppfiles/  -  Contains  recogniser  pre-processor  files 

sequences/  -  Contains  vector  quantised  training  data 

english.bk . 0128/  -  Output  from  128  element  VQ  codebook 

speechlists/  -  Contain  training  and  test  set  lists 

train_dev_<language> . tsl  -  Development  training  data 
test45_dev_<language> . tsl  -  45  second  development  test 
testlO_dev_<language> . tsl  -  10  second  development  test 
test45_eval_<language> . tsl  -  45  second  evaluation  test 
testlO_eval_<language> . tsl  -  10  second  evaluation  test 

bin/  -  Contains  various  executables 

astrec_2 .4.3  -  Continuous  speech  recogniser 

cb2hmm  -  Converts  codebook  to  HMM 

ccount_l .0.1  -  Collects  observation  counts 

choose_l . 0 . 6  -  Selects  most  'Useful'  fragments 

dpcluster_l . 0 . 2  -  Clusters  similar  fragments 
expand_l .0.0  -  Expands  observation  counts 

f ile2f ile_l , 0 . 3  -  Sequence  analysis  software 
gproc_l . 0 . 0  -  Generates  recogniser  syntax  files 

modgen_l . 0 . 0  -  Generates  recogniser  vocabulary  file 

uscore_1.0.2  -  Calculate  'Usefulness'  scores 

files/  -  Contains  various  pre-prepared  files 

main. non  -  Recogniser  syntax  file 

babble. pre  -  Recogniser  syntax  file 

drives.txt  -  Location  of  CDs 
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results/ 


Contains  experimental  results 


<language>/ 

train/ 

test/ 

tmp/ 


Target  language  (e.g.  english) 
Output  from  training  phase 
Output  from  test  phase 
Miscellaneous  files 


scripts/ 


Contains  shell  and  perl  scripts 


analyse . rep 
astrec . rep 
ccounts . rep 
classify . rep 
cluster . rep 
common . args 
countl . rep 
gensplist . prl 
score . rep 
sed. cmd 
select . rep 


-  Extract  potentially  useful  fragments 

-  Count  fragment  occurrences 

-  Accumulate  occurrence  counts 

-  Classify  test  data 

-  Cluster  similar  fragments 

-  Common  arguments 

-  Count  fragment  occurrences 

-  Generate  speechlist  for  recogniser 

-  Iterates  threshold 

-  Commands  for  'sed' 

-  Select  most  'useful'  fragments 


sre/ 


Contains  source  files 


rlabs/ 

expand . epp 
gproc . epp 
modgen . epp 


Source  developed  for  Rome  Lab . 

Expands  observation  counts 

-  Generates  recogniser  syntax  files 

-  Generates  recogniser  vocabulary  file 


test/ 


Contains  example  results 


english/ 

train/ 

test/ 

tmp/ 


Target  language  (english) 
Output  from  training  phase 
Output  from  test  phase 
Miscellaneous  files 
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A.  Binary  File  Headers 

A.  1.1  The  vector  quantiser  software  operates  on  binary  speech  data  and  annotation  files  which 
will  need  to  be  generated  if  the  software  is  to  be  used  on  new  data.  Each  binary  data  file 
has  a  header  which  describes  the  data  that  follows.  The  header  is  identified  by  the 
characters  ‘SRUHEADO’  and  contains  a  number  of  fields  as  shown  below. 


char  ; 

Ldent [8] ; 

- 

header  identifier 

int32 

byt_per_f rame ; 

- 

Number  of  bytes  per  frame 

int32 

f ile_type; 

- 

File  type 

int32 

data_type ; 

- 

Data  type 

int32 

res_len; 

- 

Not  used 

int32 

data_len ; 

- 

Length  of  data 

int32 

samplerate; 

- 

Sampling  rate 

int32 

downsample; 

- 

Downsample  rate 

float 

max_val ; 

- 

Maximum  data  value 

float 

max_scale; 

- 

Maximum  data  scale 

float 

min_val; 

- 

Minimum  data  value 

float 

min_scale; 

- 

Minimum  data  value 

int32 

no_coms ; 

- 

Number  of  comments 

int32 

com_len; 

- 

Length  of  comments 

int32 

ex_head_len ; 

- 

Length  of  extra  data 

int32 

pad_len ; 

- 

Length  of  padding  data 

int32 

offset; 

- 

offset  relative  to  signal  file 

int32 

machine; 

- 

machine  formats  flag 

int32 

reserved; 

- 

reserved 

A.  1.2  The  meanings  of  those  fields  that  are  important  for  subsequent  software  and  typical  values 
for  speech  data  and  annotation  files  are  given  in  appendices  A.l  and  A.2. 


A.2  Typical  speech  data  file  header 

A.2.1  The  following  represents  a  typical  header  from  a  speech  data  file 


identifier 

SRUHEADO 

byte  per  frame 

80 

file  type 

4  (filter 

data  type 

11  ( SRUba 

data  len 

398240 

samplerate 

8000 

downsampled 

-80 

offset 

0 

max_val 

-1.000000 

max_scale 

1.000000 

min_val 

1.000000 

min_scale 

1.000000 

number  of  comments 

4 

comment  length 

176 

extra  header  length 

0 

padding  length 

256 

architecture  bits 

0x311  (DE^ 

( DECstation, little-endian, IEEE  float  32) 
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A.2.2  It  can  be  seen  from  the  header  that  the  data  in  this  file  consists  of  log.  filterbank 
coefficients  (filetype  4),  and  the  coefficients  are  stored  as  unsigned  characters  scaled  in 
0,5dB  steps  (data  type  11).  The  original  speech  was  sampled  at  8kHz  but  the  frame  rate  of 
the  coefficients  is  lOOHz  hence  the  value  of  the  downsampled  field.  The  total  size  of  the 
filterbank  coefficients  is  398240  b)4es  (this  corresponds  to  4978  frames  and  62.2  seconds 
of  speech). 

A.2.3  There  are  four  comments  (not  displayed)  which  are  contained  in  176  bytes  following  the 
header  and  there  is  no  extra  header  information  .  In  order  for  the  total  size  of  the  header 
and  comments  to  be  a  multiple  of  512  bytes  it  was  necessary  to  append  356  padding 
characters. 

A.2,4  The  remaining  fields  such  as  maximum  and  minimum  values  are  not  used  at  present  and 
were  not  computed. 


A.3  Typical  annotation  file  header 


A.3.1  The  following  represents  a  typical  annotation  file  header 


identifier 
byte  per  frame 
file  type 
data  type 
data  len 
samplerate 
downsampled 
offset 
max_val 
max_scale 
min_val 
min_scale 
number  of  comments 
comment  length 
extra  header  length  224 
padding  length  452 

architecture  bits  0x12 


SRUHEADO 

76 

12  (annotation) 

22  (Sentence  annotation) 
2660 
19980 
-1 


0 

3546 . 000000 
1.000000 
94 . 000000 
1.000000 
6 

268 


(presumed  VAX, little-endian, VAX  float  32) 


A.3. 2  The  header  shows  that  this  file  contains  annotation  data  (file  type  12)  at  the  sentence  level 
(data  type  22).  The  annotation  data  is  stored  in  fixed  width  ‘frames’  which  in  this  case  are 
76  bytes  long.  The  total  data  length  is  2660  bytes  corresponding  to  35  annotations.  The 
signal  data  was  sampled  at  19.98  kHz  and  the  ‘start’  and  ‘end’  values  of  the  annotation  tags 
also  refer  to  the  same  sampling  rate  (downsampled  =  -1). 

A.3. 3  The  remaining  fields  are  as  described  for  the  previous  header.  Each  annotation  tag  is 
stored  as  two  32  bit  integers  for  the  start  and  end  points  followed  by  annotation  text.  The 
total  length  of  the  integers  and  text  should  equal  byt_per_f  rame. 
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